5000 Fastest Growing Private Companies in the U.S.

Summary statistics

Preliminary look at the Inc. 5000 Company List data set.

List of all variables in the dataframe

##  [1] "row_num"     "id"          "rank"        "workers"     "company"    
##  [6] "url"         "state_l"     "state_s"     "city"        "metro"      
## [11] "growth"      "revenue"     "industry"    "yrs_on_list"

Dimensions of the dataframe: 5000 companies observed over 14 variables listed above

## [1] 5000   14

Structure of dataframe with preview of data values

## 'data.frame':    5000 obs. of  14 variables:
##  $ row_num    : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ id         : int  22890 25747 25643 26098 26182 22913 22937 25413 26079 25861 ...
##  $ rank       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ workers    : int  227 191 145 62 92 50 129 130 264 11 ...
##  $ company    : Factor w/ 5000 levels "(add)ventures",..: 1725 3569 3651 4211 79 3520 1094 3357 4703 1826 ...
##  $ url        : Factor w/ 5000 levels "@properties",..: 1725 3569 3647 4211 76 3520 1094 3357 4703 1826 ...
##  $ state_l    : Factor w/ 51 levels "Alabama","Alaska",..: 5 5 48 5 22 20 5 3 38 34 ...
##  $ state_s    : Factor w/ 51 levels "AK","AL","AR",..: 5 5 47 5 20 22 5 4 38 28 ...
##  $ city       : Factor w/ 1352 levels "Acton","Ada",..: 355 355 37 930 737 46 1142 1103 979 1325 ...
##  $ metro      : Factor w/ 326 levels "","Adrian MI",..: 171 171 314 262 39 163 261 226 229 321 ...
##  $ growth     : num  158957 57348 55460 26043 20690 ...
##  $ revenue    : num  195640000 82640563 85076502 35293000 77652360 ...
##  $ industry   : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
##  $ yrs_on_list: int  2 1 1 1 1 2 2 1 1 1 ...

Explore factor variables and the different levels in State and Industry #### State

##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Puerto Rico"          "Rhode Island"         "South Carolina"      
## [43] "South Dakota"         "Tennessee"            "Texas"               
## [46] "Utah"                 "Vermont"              "Virginia"            
## [49] "Washington"           "West Virginia"        "Wisconsin"

Industry

##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Puerto Rico"          "Rhode Island"         "South Carolina"      
## [43] "South Dakota"         "Tennessee"            "Texas"               
## [46] "Utah"                 "Vermont"              "Virginia"            
## [49] "Washington"           "West Virginia"        "Wisconsin"

Summary of the data set

##     row_num           id             rank         workers     
##  Min.   :   0   Min.   :    4   5000   :   1   Min.   :    0  
##  1st Qu.:1250   1st Qu.:19575   4999   :   1   1st Qu.:   24  
##  Median :2500   Median :23292   4998   :   1   Median :   50  
##  Mean   :2500   Mean   :20037   4997   :   1   Mean   :  209  
##  3rd Qu.:3749   3rd Qu.:25370   4996   :   1   3rd Qu.:  125  
##  Max.   :4999   Max.   :26620   4995   :   1   Max.   :34219  
##                                 (Other):4994                  
##            company                 url             state_l    
##  (add)ventures :   1   @properties   :   1   California: 694  
##  @Properties   :   1   110-consulting:   1   Texas     : 404  
##  110 Consulting:   1   123stores     :   1   New York  : 335  
##  123Stores     :   1   180           :   1   Florida   : 303  
##  180           :   1   180fusion     :   1   Virginia  : 284  
##  180Fusion     :   1   1seocom       :   1   Illinois  : 238  
##  (Other)       :4994   (Other)       :4994   (Other)   :2742  
##     state_s            city                metro          growth         
##  CA     : 694   New York : 178   New York City: 399   Min.   :    42.45  
##  TX     : 404   Chicago  :  95   Washington DC: 316   1st Qu.:    84.21  
##  NY     : 335   Atlanta  :  94   Los Angeles  : 274   Median :   151.72  
##  FL     : 303   Austin   :  87   Chicago      : 224   Mean   :   516.44  
##  VA     : 284   San Diego:  80   Atlanta      : 194   3rd Qu.:   347.65  
##  IL     : 238   Houston  :  76   Dallas       : 169   Max.   :158956.91  
##  (Other):2742   (Other)  :4390   (Other)      :3424                      
##     revenue                                   industry     yrs_on_list    
##  Min.   :   1953000   IT Services                 : 733   Min.   : 1.000  
##  1st Qu.:   4876791   Advertising & Marketing     : 453   1st Qu.: 1.000  
##  Median :  10722077   Business Products & Services: 435   Median : 2.000  
##  Mean   :  43058182   Health                      : 377   Mean   : 2.744  
##  3rd Qu.:  26952131   Software                    : 338   3rd Qu.: 4.000  
##  Max.   :5528202691   Financial Services          : 278   Max.   :12.000  
##                       (Other)                     :2386

Initial Observations from a summary of the data set

  • There are 5000 companies ranked from 1 to 5000 based on their percentage growth in 2014, from greatest rate of growth (ranked 1) to slowest rate of growth (ranked 5000).
    • Greatest rate of growth is 158956.91%, lowest is 42.45%
  • There are companies representing all 50 states plus one territory (Puerto Rico), resulting in 51 levels for state.
  • The minimum number of works is 0 (need to explore further how it is possible to have no employees) with the maximum at 34219. Most companies on the list have under 150 employees.
  • The top 5 states with the greatest number of companies on the list are: California, Texas, New York, Florida, and Virginia. But the top 5 cities with the greatest number of companies on the list are: New York, Chicago, Atlanta, Austin, and San Diego. It may be worth figuring out why the top states and top cities don’t match.
  • The top industries representing greatest growth are: IT, Ad & Marketing, Business Products & Services, Health, and Software.
  • For most companies, it is their first or second year on the list. About a quarter have been on the list for more than 4 times, with 12 years being the highest number of years any one company has been on the list.

Univariate Plots Section

Histogram of states where companies are located

Histograms of workers by count

First plot doesn’t have small enough binwidths to see the trend. Reduce binwidth shows a histogram plot that skews right. What happens to distribution if I perform a long10 transformation?

Transforming the long tail by taking the log10 of workers helps better understand the distribution of workers. The transformed workers distribution looks close to a normal distribution with a longer tail on the right.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0      24      50     209     125   34220

Distribution of industry

Top industry is IT Services with almost 800 companies. IT Services is the most represented industry by a large margin. The next two industries with greatest number of companies is Ad & Marketing and Business Products & Services with just over 400 companies each, counts that are just over half of IT Services.

Distribution of revenue

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

The revenue distribution is really skewed right with a very long tail. A log10 transformation and adjusting bin width provides a more natural way to see revenue data and illustrate trends in the data. However, even after a log10 transformation, the data is still skewed to the right. Removing extremely high revenue outliers helps show a more normal distribution.

Part of the reason the distribution doesn’t look entirely normal is because the log-normal distribution looks truncated on the left side. This is likely due to the dataset containing only the top 5000 companies. If the data extended to 10,000, for example, the curve will likely look more normally distributed.

Distribution of growth

## The long tail skew to the right justifies log10 transformation.
## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     42.45     84.21    151.70    516.40    347.70 159000.00

The distribution of growth and revenue look really similar. Let’s try another type of plot to tease apart how the distributions differ. The frequency polygon plot better shows the different shapes of the distributions. The amount of growth is based on revenue generated so it is not surprising the two distributions are similar since they are highly correlated.

Distribution of Growth vs. Revenue

## Loading required package: grid

Many of the highest ranked companies are small businesses. This could be because smaller companies grow faster than big public companies. But it could also be that smaller companies are starting with smaller amounts of revenues. Absolute growth in dollars is different from percentage growth. For example, company with no revenue the previous year that gains some revenue the next year has infinite percentage growth. But this isn’t a good reflection on how much revenue the company is generating compared to another company that’s making more in absolute revenue but has a lower percentage growth.

I created two new variables, revenue 2013, calculated in terms of current revenue and percentage growth to derive last year’s revenue, and growth in dollars, which is revenue 2013 substracted from revenue 2014.

## [1] 123000 143853 153125 135000 373500 690697
## [1] 195517000  82496710  84923377  35158000  77278860 137286506

Population Dataset

There is a limitation in my data set. Without data about resident populations in each state or city or metro area it is hard to determine whether the states with the highest number of growing companies have growing companies because there are more people living there or if there is something special about that state that fosters growth. Therefore, I looked for population data from the U.S. Census Bureau and found population estimates for 2010 to 2014. This works with the company data from 2014 with the reverse engineered revenue and growth numbers I calculated for 2013.

The structure of the new dataset of state population data:

##   Geographic_Area Census_April1 Estimate_Base    Est_2010    Est_2011
## 1   United States   308,745,538   308,758,105 309,347,057 311,721,632
## 2       Northeast    55,317,240    55,318,348  55,381,690  55,635,670
## 3         Midwest    66,927,001    66,929,898  66,972,390  67,149,657
## 4           South   114,555,744   114,562,951 114,871,231 116,089,908
## 5            West    71,945,553    71,946,908  72,121,746  72,846,397
## 6         Alabama     4,779,736     4,780,127   4,785,822   4,801,695
##      Est_2012    Est_2013    Est_2014
## 1 314,112,078 316,497,531 318,857,056
## 2  55,832,038  56,028,220  56,152,333
## 3  67,331,458  67,567,871  67,745,108
## 4 117,346,322 118,522,802 119,771,934
## 5  73,602,260  74,378,638  75,187,681
## 6   4,817,484   4,833,996   4,849,377

Using dplyr, I can create a new dataset that aggregates all growth and revenue numbers for companies by state and calculates the growth per capita.

The variables and structure of the new dataset.

## [1] "state_l"              "state_growth_dollar"  "state_population2014"
## [4] "growth_per_capita"
## 'data.frame':    51 obs. of  4 variables:
##  $ state_l             : Factor w/ 51 levels "Alabama","Alaska",..: 5 45 14 33 36 48 10 22 23 44 ...
##  $ state_growth_dollar : num  18309472149 17237538957 7298472080 6272947980 6133911042 ...
##  $ state_population2014: num  38802500 26956958 12880580 19746227 11594163 ...
##  $ growth_per_capita   : num  472 639 567 318 529 ...

Univariate Analysis

What is the structure of your dataset?

I have two datasets. The original dataset is a list of the 5000 fastest growing private companies in 2014 in the U.S. from Inc. 5000. The second dataset I have is state population data from the Census Bureau. I have two resulting data frames: companies is the Inc. 5000 data set with new variables added, and state_growth is population data with additional variables.

What is/are the main feature(s) of interest in your dataset?

The variables most interesting to explore are the growth in percentage and dollar amounts since the dataset from Inc. 5000 is specifically about the fastest growing private companies in the U.S. I am also very interested in the industry the companies are in.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Revenue will be important way to understand growth. For example, a company with a small revenue will see greater gains in percentage growth than a company with larger revenue amount but the latter could have a much greater revenue and growth in absolute dollar amounts. So it is critical to interpret growth in light of revenue.

State population data is also important to better understand growth. A larger state might appear to have greater growth in absolute dollar amounts but that could be influenced by a greater population. Therefore investigating growth per capita can provide a fairer way to look at growth, especially from the point of view of smaller states.

Did you create any new variables from existing variables in the dataset?

I created 4 new variables from existing varibles across two datasets I created two new variables in the companies data frame: 1. revenue2013, 2. growth_dollar. I reverse engineered revenue from 2013 using revenue from 2014 and percentage growth. Then I substracted the 2013 revenue from 2014 revenue to get the growth_dollar.

I also created a new dataframe using the state population data from the census. In this dataframe, I added two other variables: 3. state_growth_dollar and 4. growth_per_capita. state_growth_dollar was calculated by grouping together states and summing the growth_dollar derived from the 2nd variable I created growth_dollar. The growth_per_capita variable was created by dividing growth_dollar by the state population.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The revenue, growth, and workers histograms all skewed right with a very long tail. I had to perform a log transformation to better understand the data. I performed a lot of tidying and adjusting to import and join the two data frames, including converting the population data to a numeric because the commas that separated the thousands place was causing the read.csv() command to import population numbers as characters. I needed population numbers to be numeric so I could perform division to calculate the growth_per_capita.

Bivariate Plots Section

The structure of the two datasets: 1. State population and aggregate growth of companies 2. 5000 fastest growing companies and the attributes that describe them

Scatter Matrix plots to understand the relationships between variables in the two datasets

## 'data.frame':    5000 obs. of  16 variables:
##  $ row_num          : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ id               : int  22890 25747 25643 26098 26182 22913 22937 25413 26079 25861 ...
##  $ rank             : Ord.factor w/ 5000 levels "5000"<"4999"<..: 5000 4999 4998 4997 4996 4995 4994 4993 4992 4991 ...
##  $ workers          : int  227 191 145 62 92 50 129 130 264 11 ...
##  $ company          : Factor w/ 5000 levels "(add)ventures",..: 1725 3569 3651 4211 79 3520 1094 3357 4703 1826 ...
##  $ url              : Factor w/ 5000 levels "@properties",..: 1725 3569 3647 4211 76 3520 1094 3357 4703 1826 ...
##  $ state_l          : Factor w/ 51 levels "Alabama","Alaska",..: 5 5 48 5 22 20 5 3 38 34 ...
##  $ state_s          : Factor w/ 51 levels "AK","AL","AR",..: 5 5 47 5 20 22 5 4 38 28 ...
##  $ city             : Factor w/ 1352 levels "Acton","Ada",..: 355 355 37 930 737 46 1142 1103 979 1325 ...
##  $ metro            : Factor w/ 326 levels "","Adrian MI",..: 171 171 314 262 39 163 261 226 229 321 ...
##  $ growth_percentage: num  158957 57348 55460 26043 20690 ...
##  $ revenue2014      : num  195640000 82640563 85076502 35293000 77652360 ...
##  $ industry         : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
##  $ yrs_on_list      : int  2 1 1 1 1 2 2 1 1 1 ...
##  $ revenue2013      : num  123000 143853 153125 135000 373500 ...
##  $ growth_dollar    : num  195517000 82496710 84923377 35158000 77278860 ...
## 'data.frame':    5000 obs. of  7 variables:
##  $ workers          : int  227 191 145 62 92 50 129 130 264 11 ...
##  $ growth_percentage: num  158957 57348 55460 26043 20690 ...
##  $ revenue2014      : num  195640000 82640563 85076502 35293000 77652360 ...
##  $ industry         : Factor w/ 25 levels "Advertising & Marketing",..: 5 11 2 23 24 7 13 13 25 7 ...
##  $ yrs_on_list      : int  2 1 1 1 1 2 2 1 1 1 ...
##  $ revenue2013      : num  123000 143853 153125 135000 373500 ...
##  $ growth_dollar    : num  195517000 82496710 84923377 35158000 77278860 ...
##   workers growth_percentage revenue2014                     industry
## 1     227         158956.91   195640000 Consumer Products & Services
## 2     191          57347.92    82640563              Food & Beverage
## 3     145          55460.16    85076502 Business Products & Services
## 4      62          26042.96    35293000                     Software
## 5      92          20690.46    77652360           Telecommunications
## 6      50          19876.52   137977203                       Energy
##   yrs_on_list revenue2013 growth_dollar
## 1           2      123000     195517000
## 2           1      143853      82496710
## 3           1      153125      84923377
## 4           1      135000      35158000
## 5           1      373500      77278860
## 6           2      690697     137286506

## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0

## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0

## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0
## Warning in loop_apply(n, do.ply): Stacking not well defined when ymin != 0

Which state has greatest revenue growth per capita in 2014?

The state growth in dollars shows most states clustered in the same area under $7.5 Billion. However, there are 2 states, California and Texas, with a extremely large amount of growth at $17-18 Billion. But when looking at state growth in dollars per capita, the top two states are Virginia and Colorado with Texas trailing closely behind Colorado. How does population affect growth? Future plots should explore the relationship between state population and revenue as well as state population and growth to uncover other trends.

## $title
## [1] "Revenue growth by state, normalized by population"
## 
## attr(,"class")
## [1] "labels"

As concluded earlier, Virginia, Colorado, and Texas have the fastest growth in dollars per capita. California is trailing at #13.

What’s the relationship between Revenue and Growth?

Note: Refer to the frequency polygon and density plots in the Univariate section to see the differences in distribution between revenue in 2014 and percentage growth.

The relationship between revenue in 2014 and growth appears to be strongly correlated based on the Pearson’s r value, 0.95 for 2014 revenue and growth in dollars.

## 
##  Pearson's product-moment correlation
## 
## data:  companies$revenue2014 and companies$growth_dollar
## t = 208.4471, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9440788 0.9498019
## sample estimates:
##       cor 
## 0.9470155

A highly correlated relationship also exists between revenue in 2013 and growth in dollars, with a Pearson’s r correlation of 0.77. However this relationship is weaker than the relationship between revenue in 2014 and growth, which is expected since the dataset is focusing on fastest growing companies in 2014. Growth in 2014 is clearly tied to revenue in 2013 hence the relationship between growth and revenue in 2013 is expectedly high.

## 
##  Pearson's product-moment correlation
## 
## data:  companies$revenue2013 and companies$growth_dollar
## t = 84.3823, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7548497 0.7777247
## sample estimates:
##       cor 
## 0.7665302

Let’s plot revenue and growth percentage to see the trends.

## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

## Scale for 'x' is already present. Adding another scale for 'x', which will replace the existing scale.

## Warning in loop_apply(n, do.ply): Removed 64 rows containing missing values
## (geom_point).

## 
##  Pearson's product-moment correlation
## 
## data:  companies$yrs_on_list and companies$revenue2014
## t = 10.7641, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1233182 0.1775016
## sample estimates:
##      cor 
## 0.150523
## 
##  Pearson's product-moment correlation
## 
## data:  companies$yrs_on_list and companies$revenue2013
## t = 12.0087, df = 4998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1403959 0.1942809
## sample estimates:
##       cor 
## 0.1674635

## Warning in loop_apply(n, do.ply): Removed 37 rows containing non-finite
## values (stat_boxplot).

## Warning in loop_apply(n, do.ply): Removed 37 rows containing non-finite
## values (stat_ydensity).

## Warning in loop_apply(n, do.ply): Removed 124 rows containing missing
## values (geom_point).

## Warning in loop_apply(n, do.ply): Removed 31 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 64 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 21 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 69 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 469 rows containing missing
## values (geom_point).

## Warning in loop_apply(n, do.ply): Removed 65 rows containing missing values
## (geom_point).

## Warning in loop_apply(n, do.ply): Removed 362 rows containing missing
## values (geom_point).

geom_boxplot, geom_point, geom_violin, geom_jitter with geom_rug, geom_point(stat = ‘summary’), geom_bin2d, geom_tile, geom_density2d, geom_point(alpha = 1/10, color = ‘gray’) + geom_line(stat = ‘summary’, fun.y = median), geom_point(alpha = 1/10, color = ‘gray’) + geom_step(stat = ‘summary’, fun.y = median) # Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three


Reflection